Predicting Mental Health Problems from Social Determinants and Caregiving Activities of Caregivers of Persons with Dementia: A Machine Learning Approach

BMIN503/EPID600 Final Project

Author

Hannah Cho


Updated 12/3/24

0.1 Overview

After consulting with Drs. Demiris and Huang from the School of Nursing, we decided to explore the intersection of social determinants, caregiving activities, and mental health outcomes in caregivers of persons with dementia. Our goal is to understand how these factors contribute to caregivers’ mental health challenges, with a particular focus on identifying predictive patterns through a machine learning approach.

0.2 Introduction

The challenges faced by dementia caregivers are deeply complex and multifaceted, encompassing not only the physical and emotional demands of caregiving but also social physical, and systemic barriers. These challenges are not apply physically demanding- such as providing around-the-clock care, assisting with activities of daily living, and managing the various health complications of dementia- but they also take a significant burden on the caregivers’ emotional well-being. Caregivers often experience stress, anxiety, depression, and a sense of isolation, as the demands of caregiving can leave little room for self-care, personal, or social engagement. Among them, caregivers for persons with dementia are put at high risks of anxiety and depression due to the nature of the diseases. The trajectory of dementia is not quietly common and varied based on individuals’ pre-exisitng problems and multicomorbidities. In addition to this uncertaintity, caregivers often face sigificant social and systemic barriers that limit their access to essential support services. These barriers include financial strain, a lack of accessible respite care, insufficient knowledge about available resources, and cultural or social stigma associated with caregiving. Many caregivers also experience isolation due to lack of social engagement.

Research Question: Which social strains and sociodemographic characteristics of caregivers most strongly predict anxiety and depression for caregiver of persons with living dementia, and how accurately can supervised machine learning models predict these outcomes?

0.3 Methods

Dataset: This study uses data from the National Health and Aging Trends Study (NHATS) Round 11 and the National Study of Caregiving (NSOC) Round 4, which include data collected in 2021. The NHATS is a publicly accessible dataset that includes a nationally representatative sample of adults aged 65 years old and older who are Medicare beneficiaries in the United States of America. The NSOC is conducted alongside the NHATS; participants in the NSOC are caregivers for older adults included in the NHATS. Both the NHATS and the NSOC were funded by the National Institute on Aging (R01AG062477; U01AG032947). When used together, the NHATS and NSOC provide valuable information on dyads of older adults receiving care and their family caregivers.

Samples: Persons with dementia: Probable dementia was identified based on one of the following criteria: a self-reported diagnosis of dementia or Alzheimer’s disease by a physician, a score of 2 or higher on the AD8 screening instrument administered to proxy respondents, or a score that is 1.5 standard deviations below the mean on a range of cognitive tests.Caregivers: Caregivers are identified from the NSOC and NHATS data set. Since this project specifically aims to explore caregivers of persons with dementia in the community, the sample was further filtered through dementia classification (demclass) and residency (r11dresid).

Afer retriving NHATS Round 11 and NSOC ROUND 4, I specifically selected the sample (from NHATS R11- r11demclas). And then, I merged those necessary datasets.

#Install necessary packages first.
library(haven) #for dta file
library(dplyr) 

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2) #for data visualization 
#Bring datasets 
df1 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_SP_File_V2.dta") # dementia classfication in this file
df2 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_r11.dta") #caregiver information 1
df3 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NSOC_cross.dta") #caregiver information 2
df4 <- read_dta("~/R  HC/BMIN503_Final_Project/final final/NHATS_Round_11_OP_File.dta") #older adults information 
#need to clean df1 first in order to classify dementia classes  
#ENTER WHICH ROUND?
sp1 <- df1 |>
  mutate(rnd = 11) 

#3. EDIT ROUND NUMBER INSIDE THE QUOTES 
#(THIS REMOVES THE PREFIXES ON NEEDED VARIABLES ) 
sp1 <- sp1 |>
  rename_all(~stringr::str_replace(.,"^r11","")) |>
  rename_all(~stringr::str_replace(.,"^hc11","")) |>
  rename_all(~stringr::str_replace(.,"^is11","")) |>
  rename_all(~stringr::str_replace(.,"^cp11","")) |> 
  rename_all(~stringr::str_replace(.,"^cg11",""))

#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |>
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))


#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1
sp1 <- sp1 |> 
  mutate(dad8dem = ifelse(rnd == 1, -1, dad8dem))

#SUBSET NEEDED VARIABLES
df<-sp1 |> 
  dplyr::select(spid, rnd, dresid, resptype, disescn9, chgthink1, chgthink2, chgthink3, chgthink4, chgthink5, chgthink6, chgthink7, chgthink8, dad8dem,
                speaktosp, todaydat1, todaydat2, todaydat3, todaydat4, todaydat5, presidna1, presidna3, vpname1, vpname3, quesremem, dclkdraw, atdrwclck, 
                dwrdimmrc, dwrdlstnm, dwrddlyrc)

#FIX A ROUND 2 CODING ERROR#
df <- df |>
  mutate(dwrdimmrc = ifelse(dwrdimmrc==10 & dwrddlyrc==-3 & rnd==2, -3, dwrdimmrc))

#CREATE SELECTED ROUND DEMENTIA CLASSIFICATION VARIABLE 
df <- df |>
  mutate(demclas  =  ifelse(dresid==3 | dresid==5 | dresid==7, -9, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                            ifelse((dresid==4 & rnd==1) | dresid==6 | dresid==8, -1,                #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)
                                   ifelse((disescn9==1 | disescn9==7) &           #CODE PROBABLE IF DEMENTIA DIAGNOSIS REPORTED BY SELF OR PROXY*
                                            (resptype==1 | resptype==2), 1, NA))))

#CODE AD8_SCORE*
#INITIALIZE COUNTS TO NOT APPLICABLE*
#ASSIGN VALUES TO AD8 ITEMS IF PROXY AND DEMENTIA CLASS NOT ALREADY ASSIGNED BY REPORTED DIAGNOSIS 
for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("chgthink", i, sep = "")]]==2 & df$resptype==2 & is.na(df$demclas), 0, #PROXY REPORTS NO CHANGE
                                                         ifelse((df[[paste("chgthink", i, sep = "")]]==1 | df[[paste("chgthink", i, sep = "")]] == 3) & df$resptype==2 & is.na(df$demclas), 1, #PROXY REPORTS A CHANGE OR ALZ/DEMENTIA*
                                                                ifelse(df$resptype==2 & is.na(df$demclas), NA, -1))))    #SET TO NA IF IN RES CARE AND demclass=., OTHERWISE AD8 ITEM IS SET TO NOT APPLICABLE                                                                                                                        
}

#INITIALIZE COUNTS TO NOT APPLICABLE*
for(i in 1:8){
  df[[paste("ad8miss_", i, sep = "")]]  <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]), 1,
                                                             ifelse((df[[paste("ad8_", i, sep = "")]]==0 | df[[paste("ad8_", i, sep = "")]]==1) & df$resptype==2 & is.na(df$demclas), 0, -1)))
}

for(i in 1:8){
  df[[paste("ad8_", i, sep = "")]] <- as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep = "")]]) & is.na(df$demclas) & df$resptype==2, 0, df[[paste("ad8_", i, sep = "")]]))
}

#COUNT AD8 ITEMS
#ROUNDS 2+
df <- df |>
  mutate(ad8_score = ifelse(resptype==2 & is.na(demclas), (ad8_1 + ad8_2 + ad8_3 + ad8_4 + ad8_5 + ad8_6 + ad8_7 + ad8_8), -1)) %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 
  mutate(ad8_score = ifelse(dad8dem==1 & resptype==2 & is.na(demclas), 8, ad8_score))  %>% 
  #SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 FOR ROUNDS 4-9
  mutate(ad8_score = ifelse(resptype==2 & dad8dem==-1 & chgthink1==-1 & (rnd>=4 & rnd<=9) & is.na(demclas) , 8, ad8_score)) 

#COUNT MISSING AD8 ITEMS
df <- df |> 
  mutate(ad8_miss = ifelse(resptype==2 & is.na(demclas),(ad8miss_1+ad8miss_2+ad8miss_3+ad8miss_4+ad8miss_5+ad8miss_6+ad8miss_7+ad8miss_8), -1))

#CODE AD8 DEMENTIA CLASS 
#IF SCORE>=2 THEN MEETS AD8 CRITERIA
#IF SCORE IS 0 OR 1 THEN DOES NOT MEET AD8 CRITERIA
df <- df |> 
  mutate(ad8_dem = ifelse(ad8_score>=2, 1,
                          ifelse(ad8_score==0 | ad8_score==1 | ad8_miss==8, 2, NA)))

#UPDATE DEMENTIA CLASSIFICATION VARIABLE WITH AD8 CLASS
df <- df |> 
  #PROBABLE DEMENTIA BASED ON AD8 SCORE  
  mutate(demclas = ifelse(ad8_dem==1 & is.na(demclas), 1, 
                          #NO DIAGNOSIS, DOES NOT MEET AD8 CRITERION, AND PROXY SAYS CANNOT ASK SP COGNITIVE ITEMS*
                          ifelse(ad8_dem==2 & speaktosp==2 & is.na(demclas), 3, demclas)))


####CODE DATE ITEMS AND COUNT 
#CODE ONLY YES/NO RESPONSES: MISSING/NA CODES -1, -9 LEFT MISSING*
#2: NO/DK OR -7: REFUSED RECODED TO : NO/DK/RF*
#****ADD NOTES HERE ABOUT WHAT IS HAPPENING IN ROUNDS 1-3, 5+ VS. ROUND 4 
#*
for(i in 1:5){
  df[[paste("date_item", i, sep = "")]]  <- as.numeric(ifelse(df[[paste("todaydat", i, sep = "")]]==1, 1,
                                                              ifelse(df[[paste("todaydat", i, sep = "")]]==2 | df[[paste("todaydat", i, sep = "")]]== -7, 0, NA)))
}

#COUNT CORRECT DATE ITEMS
df <- df |>
  mutate(date_item4 = ifelse(rnd==4, date_item5, date_item4)) %>% 
  mutate(date_sum = date_item1 + date_item2 + date_item3 + date_item4) %>% 
  
  #PROXY SAYS CAN'T SPEAK TO SP
  mutate(date_sum = ifelse(speaktosp==2 & is.na(date_sum),-2,  
                           #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER*
                           ifelse((is.na(date_item1) | is.na(date_item2) | is.na(date_item3) | is.na(date_item4)) & speaktosp==1,-3, date_sum))) %>% 
  
  #MISSING IF PROXY SAYS CAN'T SPEAK TO SP*  
  mutate(date_sumr = ifelse(date_sum == -2 , NA, 
                            #0 IF SP UNABLE TO ANSWER*
                            ifelse(date_sum == -3 , 0, date_sum)))


########PRESIDENT AND VICE PRESIDENT NAME ITEMS AND COUNT########## 
##CODE ONLY YES/NO RESPONSES: MISSING/N.A. CODES -1,-9 LEFT MISSING *
##2:NO/DK OR -7:REFUSED RECODED TO 0:NO/DK/RF*
df <- df |>
  mutate(preslast = ifelse(presidna1 == 1, 1,
                           ifelse(presidna1 == 2 | presidna1 == -7, 0, NA))) |> 
  mutate(presfirst = ifelse(presidna3 == 1, 1,
                            ifelse(presidna3 == 2 | presidna3 == -7, 0, NA))) |> 
  mutate(vplast = ifelse(vpname1 == 1, 1,
                         ifelse(vpname1 == 2 | vpname1 == -7, 0, NA))) |> 
  mutate(vpfirst = ifelse(vpname3 == 1, 1,
                          ifelse(vpname3 == 2 | vpname3 == -7, 0, NA))) |> 
  
  #COUNT CORRECT PRESIDENT/VP NAME ITEMS*
  mutate(presvp = preslast + presfirst + vplast + vpfirst) |> 
  #PROXY SAYS CAN'T SPEAK TO SP 
  mutate(presvp = ifelse(speaktosp == 2 & is.na(presvp), -2, 
                         #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER                           
                         ifelse((is.na(preslast) | is.na(presfirst) | is.na(vplast) | is.na(vpfirst)) & speaktosp==1 & is.na(presvp),-3, presvp))) |> 
  
  #MISSING IF PROXY SAYS CAN’T SPEAK TO SP*
  mutate(presvpr =  ifelse(presvp == -2 , NA, 
                           ifelse(presvp == -3 , 0, presvp))) |> 
  
  #ORIENTATION DOMAIN: SUM OF DATE RECALL AND PRESIDENT/VP NAMING* 
  mutate(date_prvp = date_sumr + presvpr)


#######EXECUTIVE FUNCTION DOMAIN: CLOCK DRAWING SCORE##########
#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUND 10 ONLY)* 
df <- df |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd==10, -2,
                           ifelse(speaktosp==1 & (quesremem==2 | quesremem==-7 | quesremem==-8) & dclkdraw==-9 & rnd==10, -3,
                                  ifelse(atdrwclck==2 & dclkdraw==-9 & rnd==10, -4,
                                         ifelse(atdrwclck==97 & dclkdraw==-9 & rnd==10, -7, dclkdraw)))))

#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUNDS 11 AND FORWARD ONLY)* 
df<-df  |>
  mutate(dclkdraw = ifelse(speaktosp == 2 & dclkdraw == -9 & rnd>=11, -2, 
                           ifelse(speaktosp == 1 & (quesremem == 2 | quesremem == -7 | quesremem == -8) & dclkdraw == -9, -3 & rnd>=11, dclkdraw))) 
df<-df  |>
  mutate(clock_scorer = ifelse(dclkdraw == -3 | dclkdraw == -4 | dclkdraw == -7, 0,
                               #IMPUTE MEAN SCORE TO PERSONS MISSING A CLOCK*
                               #IF PROXY SAID CAN ASK SP*
                               ifelse(dclkdraw == -9 & speaktosp == 1, 2, 
                                      #IF SELF-RESPONDENT*       
                                      ifelse(dclkdraw == -9 & speaktosp == -1, 3, 
                                             ifelse(dclkdraw == -2 | dclkdraw == -9, NA, dclkdraw)))))


#MEMORY DOMAIN: IMMEDIATE AND DELAYED WORD RECALL 
df <- df |>
  mutate(irecall  =  ifelse(dwrdimmrc == -2 | dwrdimmrc == -1, NA,
                            ifelse(dwrdimmrc == -7 | dwrdimmrc == -3, 0, dwrdimmrc))) |> 
  mutate(irecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, irecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(drecall  =  ifelse(dwrddlyrc == -2 | dwrddlyrc == -1, NA,
                            ifelse(dwrddlyrc == -7 | dwrddlyrc == -3, 0, dwrddlyrc))) |> 
  mutate(drecall = ifelse(rnd==5 & dwrddlyrc==-9, NA, drecall)) |>  #round 5 only: set cases with missing word list and not previously assigned to missing
  
  mutate(wordrecall0_20 = irecall+drecall)


#CREATE COGNITIVE DOMAINS FOR ALL ELIGIBLE 

df<-df |> 
  mutate(clock65 = ifelse(clock_scorer == 0 | clock_scorer==1, 1, 
                          ifelse(clock_scorer > 1 & clock_scorer<6, 0, NA)))

df<-df |>  
  mutate(word65 = ifelse(wordrecall0_20 >= 0 & wordrecall0_20 <=3, 1, 
                         ifelse(wordrecall0_20 > 3 & wordrecall0_20 <=20, 0, NA)))

df<-df |>  
  mutate(datena65 = ifelse(date_prvp >= 0 & date_prvp <=3, 1, 
                           ifelse(date_prvp > 3 & date_prvp <= 8, 0, NA)))

#  *CREATE COGNITIVE DOMAIN SCORE*
df<-df |> 
  mutate(domain65 = clock65+word65+datena65)

#*SET CASES WITH MISSING WORD LIST AND NOT PREVIOUSLY ASSIGNED TO MISSING (ROUND 5 ONLY)
df<-df |>   
  mutate(demclas = ifelse(rnd==5 & dwrdlstnm==-9 & is.na(demclas), -9, demclas))

#UPDATE COGNITIVE CLASSIFICATION*
df<-df |> 
  #PROBABLE DEMENTIA
  mutate(demclas = ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & (domain65==2 | domain65==3), 1,
                          #POSSIBLE DEMENTIA
                          ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==1, 2,
                                 #NO DEMENITA                    
                                 ifelse(is.na(demclas) & (speaktosp == 1 | speaktosp == -1) & domain65==0, 3, demclas))))

#KEEP VARIABLES AND SAVE DATA
df<-df |> 
  dplyr::select(spid, rnd, demclas)

#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
r11demclas <- df

#4. NAME AND SAVE DEMENTIA DATA FILE:
#CHANGE # AFTER "r" TO THE ROUND OF INTEREST
save(r11demclas, file = "~/R  HC/BMIN503_Final_Project/final final/NHATS_r11.dta") 

Merging the data set

#merged datasets (md). md1 <- left_join(df, df1, by = "spid") md2 <- left_join(md1, df3, by = "spid")
#merged datasets (md). 
md1 <- left_join(df, df1, by = "spid")
md2 <- left_join(md1, df3, by = "spid")

# choose probable dementia and dementia patients who live at home
dementia1 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1")))

dementia2 <- md2 |>
  filter(demclas %in% c("1", "2") & (r11dresid  %in% c("1", "2")))

Predictors: Caregiver level factors are identified as caregivers’ age, race, gender, self-reported income, and the highest education level. Also, these are recoded accordingly. The education level of the caregivers was categorized as “Less than high school (0)”, “High School (1)”, and “College or above (2).” For economic status, the caregivers’ reported income from the previous year was used. This study included both informal and formal support as part of the caregivers’ social determinants of health. Informal support included having friends or family (a) to talk to about important life matters, (b) to help with daily activities, such as running errands, and (c) to assist with care provision.10 Formal support included (a) participation in a support group for caregivers, (b) access to respite services that allowed the caregiver to take time off, and (c) involvement in a training program that assisted the caregiver in providing care for the care recipient.10 We used these individual items as support questions and each support question was answered by indicating whether or not they received support.

# Caregiver's Age (renaming variable) #chd11dage
#Race
# Recode `race` to create a new binary variable
# 1 for "White, non-Hispanic" and 0 for "Non-White"
# Recode race to create a new variable 'race_recode'
dementia1 <- dementia1 |>
  mutate(
    race_recode = case_when(
      crl11dcgracehisp == 1 ~ 0,  # White, non-Hispanic
      crl11dcgracehisp == 2 ~ 1, #black, non-hispanic
      crl11dcgracehisp == 3 ~ 2,  # others
      crl11dcgracehisp == 4 ~ 3,
      chd11educ %in% c(5, 6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases# hispanics
    ))  

table(dementia1$crl11dcgracehisp)

  1   2   3   4   5   6 
250 193  16  46   1  22 
table(dementia1$race_recode)

  0   1   2   3 
250 193  16  46 
# Gender: Male as reference (0), Female as 1
dementia1 <- dementia1 |>
  mutate(
    gender_recode = case_when(
      as.character(c11gender) == "1" ~ 0,  # Male
      as.character(c11gender) == "2" ~ 1,  # Female
      TRUE ~ NA_real_  # Handle any other unexpected cases
    )
  )

# Education: Recoding education levels into two categories
table(dementia1$chd11educ)

 -8  -7  -6   1   2   3   4   5   6   7   8   9 
 13   1   2   1   5  36 132  35 101  50  83  69 
dementia1 <- dementia1 |>
  mutate(
    edu_recode = case_when(
      chd11educ %in% c(1, 2, 3) ~ 1,      # Below and high school diploma
      chd11educ %in% c(4, 5) ~ 1,          # Some college
      chd11educ %in% c(6, 7, 8) ~ 2,       # College and beyond
      chd11educ == c(9) ~ 3,  
      chd11educ %in% c(-8, -7, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cases
    )
  )
table(dementia1$edu_recode)

  1   2   3 
209 234  69 
# Marital Status: Recoding marital status into binary (married vs. not married)
dementia1 <- dementia1 |>
  mutate(
    martstat_recode = case_when(
      chd11martstat == 1 ~ 0,     # Married
      chd11martstat %in% 2:6 ~ 1, # Not married (single, divorced, etc.)
      chd11martstat %in% c(-8, -6) ~ NA_real_, # Missing or not applicable
      TRUE ~ NA_real_                      # Unhandled cas
    )
  )

table(dementia1$martstat_recode)

  0   1 
204 232 

Caregivers’ caregiving activities

#recoding: 1(everyday), 2(most day), 3(someday)–1; 4(rarely),5(never)0 #C11 CA1 HOW OFT HELP WITH CHORES (cca11hwoftchs) #C11 CA2 HOW OFTEN SHOPPED FOR SP (cca11hwoftshp) #C11 CA6 HOW OFT HELP PERS CARE (cca11hwoftpc ) #C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) #C11 CA7 HOW OFT HLP GTNG ARD HOME (cca11hwofthom) #11 CA9 HOW OFTEN DROVE SP (cca11hwoftdrv) #C11 CA10 OFTN WENT ON OTH TRANSPR (cca11hwoftott )

library(dplyr)
dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hwoftchs, cca11hwoftshp, cca11hwoftpc, 
      cca11hwofthom, cca11hwoftdrv, cca11hwoftott), 
  )) |>
  mutate(
    cca11hwoftchs_recoded = case_when(
      cca11hwoftchs %in% c(1, 2, 3) ~ 1,
      cca11hwoftchs %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftshp_recoded = case_when(
      cca11hwoftshp %in% c(1, 2, 3) ~ 1,
      cca11hwoftshp %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftpc_recoded = case_when(
      cca11hwoftpc %in% c(1, 2, 3) ~ 1,
      cca11hwoftpc %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwofthom_recoded = case_when(
      cca11hwofthom %in% c(1, 2, 3) ~ 1,
      cca11hwofthom %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftdrv_recoded = case_when(
      cca11hwoftdrv %in% c(1, 2, 3) ~ 1,
      cca11hwoftdrv %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    ),
    cca11hwoftott_recoded = case_when(
      cca11hwoftott %in% c(1, 2, 3) ~ 1,
      cca11hwoftott %in% c(4, 5) ~ 0,
      TRUE ~ NA_real_
    )
  )
  

#C11 CA3 EVER HELP ORDER MEDICINES (cca11hlpordmd) - yes 1 no 2-->0
#C11 CA5A GO ONLINE billing or banking (cca11hlpbnkng)
#C11 CA5A GO ONLINE SHOP FR GROC (cca11cmpgrcry)
#C11 CA5B GO ONLINE ORDER MEDS (cca11cmpordrx )
#C11 CA5C GO ONLINE BILLS BANKNG (cca11cmpbnkng )
#C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) (cca11hlpteeth) 
#cdc11hlpyrpls 

dementia1 <- dementia1 |>
  mutate(across(
    c(cca11hlpordmd, cca11hlpbnkng, cca11cmpgrcry, 
      cca11cmpordrx, cca11cmpbnkng, cca11hlpteeth,cdc11hlpyrpls),
    ~ case_when(
        . == 1 ~ 1,
        . == 2 ~ 0,
        . %in% c(-8, -6, -1) ~ NA_real_,
        TRUE ~ NA_real_
      ),
    .names = "{.col}_recoded"
  ))

#caregiver’s caregiving activities cca11hwoftchs #C11 CA1 HOW OFT HELP WITH CHORES cca11hwoftshp #C11 CA2 HOW OFTEN SHOPPED FOR SP cca11hwoftpc #C11 CA6 HOW OFT HELP PERS CARE cca11hwofthom #C11 CA7 HOW OFT HLP GTNG ARD HOME cca11hwoftdrv #11 CA9 HOW OFTEN DROVE SP cca11hwoftott #C11 CA10 OFTN WENT ON OTH TRANSPR

1 caregiver’s features

che11enrgylmt #Energy often limited cac11diffphy # Caregiver physical difficulty helping cac11exhaustd #Caregiver exhausted at night cac11toomuch #Care more than can handle cac11uroutchg #Care routine then changes cac11notime #No time for self (-8,-7,-6) cac11diffemlv #Caregiver emotional difficulty (-1, 1-5) cpp11hlpkptgo #Kept from going out (-6-1, 1,2) che11health #General health (-8, 1-5) che11sleepint #Interrupted sleep (-8,-7,-6, 1-5) op11numhrsday #Number of hours per day help (-7,-1, 1-6) op11numdaysmn #Number of days help per month (-7,-1 1-6)

#Sociodemographic features (persons living with dementia and caregiver) table(dementia1$cac11notime)

op11leveledu #Caregiver education (na -1, 1-5) cac11diffinc #Caregiver financial difficulties (-8,-7,-6, binary 1,2) ew11progneed1 #Persons living with dementia received food stamps (-8, -7 binary) ew11finhlpfam #Persons living with dementia financial help from family (-8, -7) binary mc11havregdoc #Persons living with dementia have a regular doctor (1,2binary) hc11hosptstay #Persons living with dementia hospital stay in last (1,2) 12-months hc11hosovrnht #Persons living with dementia number of hospital stays (-7, -1, 1-6 times)

Outcomes: Caregivers’ anxiety and depressive symptoms are measured by two questions each.First, anxiety was measured Generalized Anxiety Disorder-2 (GAD-2) Scale which consists of two questions. Since the NHATS provided GAD-2 data, this study utilized it to measure anxiety levels among care recipients. Each item on the scale is rated on a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting in a total score between 0 and 6. Higher scores correspond to greater anxiety, with a total GAD-2 score of 3 or more indicating anxiety.

The care recipients’ depression was evaluated using the Patient Health Questionnaire-2 (PHQ-2) Scale. Given that the NHATS included PHQ-2, this study utilized it to measure depression in care recipients. Each item on the scale was measured with a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting a total score between 0 and 6, with higher scores indicating more severe depression. A PHQ-2 score ranges from 0-6. The authors identified a score of 3 as the optimal cutpoint when using the PHQ-2 to screen for depression. If the score is 3 or greater, major depressive disorder is likely.

# Sum the two questions for GAD2
dementia1$total_gad2 <- dementia1$che11fltnervs + dementia1$che11fltworry
# Recode the combined variable using a cut-off of 3
dementia1$gad2_cg_cat <- ifelse(dementia1$total_gad2 < 3, 0, 1)
 table(dementia1$gad2_cg_cat)

  0   1 
279 249 
 summary(dementia1$gad2_cg_cat) #1 ~ anxiety
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4716  1.0000  1.0000     251 
# Sum of the two questions for PHQ2 (che11fltltlin + che11fltdown) 
dementia1$total_phq2 <- dementia1$che11fltltlin+ dementia1$che11fltdown
#Recode the combined variable using a cut-off of 3
dementia1$phq2_cg_cat <- ifelse(dementia1$total_phq2 < 3, 0, 1)
 table(dementia1$phq2_cg_cat)

  0   1 
276 252 
 summary(dementia1$phq2_cg_cat) #1 ~ depression
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.4773  1.0000  1.0000     251 

Data analysis

For data analysis, we first conducted descriptive analyses, including means, standard deviations, ranges, and percentages, to summarize the dataset. To investigate how caregivers’ social strains and caregiver-level factors influence caregiver depression, we performed logistic regression analyses. Guided by the conceptual framework of this study, univariate logistic regression analyses were employed to identify caregivers’ social strains and caregiver-level factors significantly associated with caregiver anxiety and depression, controlling for care recipient-level factors. Variables with a p-value below 0.05 in the univariate analyses were included in the subsequent multivariate logistic regression model. The multivariate model was then constructed to determine which factors most strongly influenced caregiver anxiety and depression. All statistical analyses were conducted using R, with statistical significance set at a p-value of less than 0.05.

1.1

#creating subset to work more effectively
library(dplyr)
selected_vars <- c(
  "spid","demclas", "cca11hwoftchs", "cca11hwoftshp", "cca11hwoftpc",
  "cca11hwofthom", "cca11hwoftdrv", "cca11hwoftott",
  "che11enrgylmt", "cac11diffphy", "cac11exhaustd",
  "cac11toomuch", "cac11uroutchg", "cac11notime",
  "cac11diffemlv", "cpp11hlpkptgo", "che11health", 
  "che11sleepint", 
   "ew11progneed1", "ew11finhlpfam", 
  "mc11havregdoc", "hc11hosptstay", "hc11hosovrnht", 
  "cac11diffinc", "pa11hlkepfvst", "pa11hlkpfrclb", "pa11hlkpgoenj", "pa11hlkpfrwrk", "pa11hlkpfrvol", "pa11prcranoth", "race_recode", "gender_recode", "edu_recode", "chd11dage", "martstat_recode", "phq2_cg_cat", "gad2_cg_cat"
)

dementia_subset <- dementia1 |> select(all_of(selected_vars))
head(dementia_subset)
# A tibble: 6 × 37
      spid demclas cca11hwoftchs    cca11hwoftshp    cca11hwoftpc  cca11hwofthom
     <dbl>   <dbl> <dbl+lbl>        <dbl+lbl>        <dbl+lbl>     <dbl+lbl>    
1 10000036       1 NA               NA               NA            NA           
2 10000041       2  1 [1 EVERY DAY]  1 [1 EVERY DAY]  1 [1 EVERY …  1 [1 EVERY …
3 10000041       2  5 [5 NEVER]      4 [4 RARELY]     5 [5 NEVER]   3 [3 SOME D…
4 10000051       1  1 [1 EVERY DAY]  3 [3 SOME DAYS]  2 [2 MOST D…  3 [3 SOME D…
5 10000064       1  2 [2 MOST DAYS]  5 [5 NEVER]      3 [3 SOME D…  3 [3 SOME D…
6 10000064       1  1 [1 EVERY DAY]  1 [1 EVERY DAY]  5 [5 NEVER]   3 [3 SOME D…
# ℹ 31 more variables: cca11hwoftdrv <dbl+lbl>, cca11hwoftott <dbl+lbl>,
#   che11enrgylmt <dbl+lbl>, cac11diffphy <dbl+lbl>, cac11exhaustd <dbl+lbl>,
#   cac11toomuch <dbl+lbl>, cac11uroutchg <dbl+lbl>, cac11notime <dbl+lbl>,
#   cac11diffemlv <dbl+lbl>, cpp11hlpkptgo <dbl+lbl>, che11health <dbl+lbl>,
#   che11sleepint <dbl+lbl>, ew11progneed1 <dbl+lbl>, ew11finhlpfam <dbl+lbl>,
#   mc11havregdoc <dbl+lbl>, hc11hosptstay <dbl+lbl>, hc11hosovrnht <dbl+lbl>,
#   cac11diffinc <dbl+lbl>, pa11hlkepfvst <dbl+lbl>, pa11hlkpfrclb <dbl+lbl>, …
# Select only numeric columns
numeric_dementia1 <- dementia_subset |> select(where(is.numeric))

# Compute the correlation matrix
cor_matrix <- cor(numeric_dementia1, method = "kendall", use = "pairwise.complete.obs")
# Filter correlations above a threshold
high_corr <- which(abs(cor_matrix) > 0.8 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_pairs <- data.frame(
  Var1 = rownames(cor_matrix)[high_corr[, 1]],
  Var2 = colnames(cor_matrix)[high_corr[, 2]],
  Correlation = cor_matrix[high_corr]
)
print(high_corr_pairs)
           Var1          Var2 Correlation
1 cca11hwoftdrv cca11hwoftdrv   1.0000000
2 cca11hwoftott cca11hwoftott   1.0000000
3 hc11hosptstay hc11hosptstay   1.0000000
4 hc11hosovrnht hc11hosptstay  -0.9391286
5 hc11hosptstay hc11hosovrnht  -0.9391286
6 pa11prcranoth pa11prcranoth   1.0000000
7 gender_recode gender_recode   1.0000000
# Install and load pheatmap if necessary
library(pheatmap)

# Create the heatmap
pheatmap(cor_matrix, color = colorRampPalette(c("blue", "white", "red"))(50))

Since my outcomes are GAD-2 and PHQ-2, I am most focused on those variables surrounding the square: che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, and ew11progneed.

Pearson’s correlation matrices were presented as heatmaps for both rounds (5 and 7) to visually assess the data and evaluate the independence of variables (Supplementary Figure 2a and b). The Pearson’s correlation coefficient quantifies the linear relationship between two continuous variables, with values ranging from −1 to +1. A coefficient of 0 indicates no linear correlation, while negative and positive values represent negative and positive correlations, respectively. A p-value of < .05 was used to define statistical significance. Then I selected the most representative variable to reduce redundancy in concepts among highly correlated items. The negative values in the dataset were not addressed in this step.

1.1.1 Feature selection

The original dataset includes numerous negative and N/A values, and the sample size is small, necessitating preprocessing before feature selection. To handle the small sample size, negative values were transformed into N/A values. These N/A values were then imputed based on the type of feature. For continuous variables, N/A values were replaced with the median of the respective column, while the most frequent category level was used to impute N/A values for categorical features.

che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, ew11progneed.

#coverting negative value to NA
dementia_subset[dementia_subset < 0] <- NA

#imputation
continuous_vars <- sapply(dementia_subset, is.numeric)  # categorical
dementia_subset[continuous_vars] <- lapply(dementia_subset[continuous_vars], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))

# categorical: change NA to max
categorical_vars <- sapply(dementia_subset, is.factor)  # find
dementia_subset[categorical_vars] <- lapply(dementia_subset[categorical_vars], function(x) {
  levels(x) <- append(levels(x), names(sort(table(x), decreasing = TRUE))[1])  
  ifelse(is.na(x), names(sort(table(x), decreasing = TRUE))[1], x)  
})
#final cleaning for dataset
#choosing only one caregiver for each participant
final <- dementia_subset |>
  group_by(spid) |>
  slice_head(n = 1) |>
  ungroup()

#total 563 caregivers 

#creating new dataset for each outcome
anxiety <- subset(
  final,
  select = c(gad2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

depression <- subset(
  final,
  select = c(phq2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, 
             cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, 
             chd11dage, martstat_recode)
)

1.2 Results

This exploratory study employs multiple machine learning techniques—including correlation matrix analysis, glm, and random forest (RF)—to identify key predictors of caregiver depression and anxiety. This multipronged approach is essential given the diverse types of data in this study. Machine learning methods, being inductive, support hypothesis generation and allow for systematic feature reduction by excluding variables deemed unimportant across multiple methods. This approach refines the feature set, enhancing the interpretability and predictive accuracy of the models.

1.2.1 GLM

1.2.1.1 Anxiety

#creating training/ testing data
anxiety_split <- initial_split(anxiety, 
                            prop = 0.80)
anxiety_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
anxiety_train <- training(anxiety_split)
anxiety_test <- testing(anxiety_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

anxiety_train$gad2_cg_cat <- as.factor(anxiety_train$gad2_cg_cat)
anxiety_test$gad2_cg_cat <- as.factor(anxiety_test$gad2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

## top variables
significant_predictors <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors)
# A tibble: 7 × 5
  term            estimate std.error statistic       p.value
  <chr>              <dbl>     <dbl>     <dbl>         <dbl>
1 cac11diffphy      -1.04      0.385     -2.70 0.00685      
2 cac11exhaustd     -1.40      0.241     -5.78 0.00000000735
3 che11health        0.362     0.176      2.05 0.0399       
4 che11sleepint     -0.440     0.191     -2.30 0.0213       
5 race_recode       -0.436     0.196     -2.23 0.0260       
6 gender_recode     -0.839     0.338     -2.49 0.0129       
7 martstat_recode   -0.867     0.374     -2.31 0.0207       
### 
anxiety_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(gad2_cg_cat ~ .)


  # Fit the workflow to the test data
anxiety_glm_fit <- anxiety_glm_wf |> 
  fit(data = anxiety_test)

# Generate predictions with probabilities
anxiety_glm_predicted <- predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")

anxiety_glm_predicted 
# A tibble: 113 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1 0.970    0.0303
 2 0.941    0.0590
 3 0.509    0.491 
 4 0.700    0.300 
 5 0.00881  0.991 
 6 0.545    0.455 
 7 0.970    0.0303
 8 0.970    0.0303
 9 0.970    0.0303
10 0.970    0.0303
# ℹ 103 more rows
# Combine into a single data frame
anxiety_glm_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_glm_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_glm_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)

print(anxiety_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0           0.970    0.0303
 2 0     0           0.941    0.0590
 3 1     0           0.509    0.491 
 4 0     0           0.700    0.300 
 5 1     1           0.00881  0.991 
 6 0     0           0.545    0.455 
 7 0     0           0.970    0.0303
 8 0     0           0.970    0.0303
 9 0     0           0.970    0.0303
10 0     0           0.970    0.0303
# ℹ 103 more rows
#cross validation  10--> 20
anxiety_folds<-vfold_cv(anxiety_train, v=20)

glm_workflow<-workflow()|>
  add_model(lr_class_spec)|>
  add_formula(gad2_cg_cat ~ .)

glm_fit_cv<-glm_workflow|>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.815    20  0.0194 Preprocessor1_Model1
2 brier_class binary     0.135    20  0.0129 Preprocessor1_Model1
3 roc_auc     binary     0.868    20  0.0194 Preprocessor1_Model1
#AUC 0.8808

anxiety_glm_cv_preds<-collect_predictions(glm_fit_cv)
anxiety_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
anxiety.lr.pred.values.test <-  bind_cols(
  truth = anxiety_test$gad2_cg_cat,
  predict(lr_class_fit, anxiety_test),
  predict(lr_class_fit, anxiety_test, type = "prob")
)
anxiety.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.929  0.0710
 2 0     0             0.958  0.0423
 3 1     0             0.807  0.193 
 4 0     0             0.636  0.364 
 5 1     1             0.334  0.666 
 6 0     1             0.437  0.563 
 7 0     0             0.929  0.0710
 8 0     0             0.929  0.0710
 9 0     0             0.929  0.0710
10 0     0             0.929  0.0710
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(anxiety.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(anxiety.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.832
2 kap         binary         0.523
3 mn_log_loss binary         0.349
4 roc_auc     binary         0.9  
#roc_auc 0.7938

Depression

#creating training/ testing data
depression_split <- initial_split(depression, 
                            prop = 0.80)
depression_split #<450/113/563>
<Training/Testing/Total>
<450/113/563>
depression_train <- training(depression_split)
depression_test <- testing(depression_split)

#glm
set.seed(123)
lr_class_spec<-logistic_reg()|>
  set_engine("glm")

depression_train$phq2_cg_cat <- as.factor(depression_train$phq2_cg_cat)
depression_test$phq2_cg_cat <- as.factor(depression_test$phq2_cg_cat)

lr_class_fit <- lr_class_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

## top variables
significant_predictors_depression <- tidy(lr_class_fit$fit) |>
  filter(p.value < 0.05)
print(significant_predictors_depression)
# A tibble: 4 × 5
  term          estimate std.error statistic     p.value
  <chr>            <dbl>     <dbl>     <dbl>       <dbl>
1 cac11exhaustd   -1.22      0.237     -5.16 0.000000252
2 race_recode     -0.601     0.196     -3.07 0.00213    
3 gender_recode   -0.658     0.335     -1.97 0.0494     
4 edu_recode      -0.632     0.246     -2.57 0.0101     
### 
depression_glm_wf <- workflow() |>
  add_model(lr_class_spec) |>
  add_formula(phq2_cg_cat ~ .)

# Fit the workflow to the test data
depression_glm_fit <- depression_glm_wf |> 
  fit(data = depression_test)

# Generate predictions with probabilities
depression_glm_predicted <- predict(depression_glm_fit, new_data = depression_test, type = "prob")

depression_glm_predicted 
# A tibble: 113 × 2
   .pred_0 .pred_1
     <dbl>   <dbl>
 1  0.0864  0.914 
 2  0.0417  0.958 
 3  0.958   0.0417
 4  0.182   0.818 
 5  0.947   0.0526
 6  0.947   0.0526
 7  0.947   0.0526
 8  0.0141  0.986 
 9  0.947   0.0526
10  0.947   0.0526
# ℹ 103 more rows
# Combine into a single data frame
depression_glm_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_glm_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_glm_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)

print(depression_glm_pred_values)
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     1            0.0864  0.914 
 2 1     1            0.0417  0.958 
 3 0     0            0.958   0.0417
 4 1     1            0.182   0.818 
 5 0     0            0.947   0.0526
 6 0     0            0.947   0.0526
 7 0     0            0.947   0.0526
 8 1     1            0.0141  0.986 
 9 0     0            0.947   0.0526
10 0     0            0.947   0.0526
# ℹ 103 more rows
#cross validation  10--> 20
depression_folds<-vfold_cv(depression_train, v=20)

depression_glm_wf <- workflow ()|>
  add_model(lr_class_spec)|>
  add_formula(phq2_cg_cat ~ .)

depression_glm_fit_cv<-depression_glm_wf |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.789    20  0.0225 Preprocessor1_Model1
2 brier_class binary     0.144    20  0.0109 Preprocessor1_Model1
3 roc_auc     binary     0.829    20  0.0279 Preprocessor1_Model1
#AUC 0.8219

depression_glm_cv_preds<-collect_predictions(depression_glm_fit_cv)
depression_glm_cv_preds |>
  group_by(id) |>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

#Prediction on the test data
depression.lr.pred.values.test <-  bind_cols(
  truth = depression_test$phq2_cg_cat,
  predict(lr_class_fit, depression_test),
  predict(lr_class_fit, depression_test, type = "prob")
)
depression.lr.pred.values.test
# A tibble: 113 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 1     0             0.805  0.195 
 2 1     1             0.280  0.720 
 3 0     0             0.841  0.159 
 4 1     0             0.846  0.154 
 5 0     0             0.918  0.0817
 6 0     0             0.918  0.0817
 7 0     0             0.918  0.0817
 8 1     1             0.234  0.766 
 9 0     0             0.918  0.0817
10 0     0             0.918  0.0817
# ℹ 103 more rows
#Plot of ROC curve of prediction on test results
autoplot(roc_curve(depression.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(depression.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.796
2 kap         binary         0.485
3 mn_log_loss binary         0.407
4 roc_auc     binary         0.884
#roc_auc 0.8622

1.2.2 Random Forest (RF)

1.2.2.1 Anxiety

library(tune)
library(parsnip)
library(recipes)
library(rsample)
library(workflows)

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit<-rf_spec|>
  fit(gad2_cg_cat ~ ., data=anxiety_train)

## top variables
rf_fit|>
  extract_fit_engine()|>
  vip()

rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                        0            1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd   18.160438 28.610875520            26.602243        24.220057
chd11dage       16.715676  8.228809762            18.528249        18.940090
che11sleepint   15.662060 17.942512544            21.041412        16.055477
che11health     19.310556 -0.001371936            17.494747        11.733905
cac11diffphy    18.406689 12.979238174            20.710571        11.716736
race_recode      6.901280 14.793121062            11.430766        10.468367
cca11hwoftchs    8.024048  3.456235186             8.771303         8.433153
cac11diffemlv   12.998192  7.006421241            14.135251         7.835222
che11enrgylmt   18.790061  9.946644459            19.939498         7.590405
cca11hwoftshp    9.736861 -3.384750499             6.982284         7.143561
edu_recode       9.703614 -8.132997303             5.397654         5.385959
martstat_recode  6.638045  3.630095419             7.566522         3.599143
cca11hwoftott   12.350941 -5.412171090             8.648917         3.360422
gender_recode    5.239941 -0.306576740             4.604304         2.943583
ew11progneed1    5.582404 -1.110006962             3.256633         2.321722
#20fold CV- rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)
rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv <- rf_workflow |>
  fit_resamples(anxiety_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.826    20 0.0240  Preprocessor1_Model1
2 brier_class binary     0.113    20 0.00912 Preprocessor1_Model1
3 roc_auc     binary     0.899    20 0.0153  Preprocessor1_Model1
#roc_auc = 0.9100
anxiety_rf_cv_preds<-collect_predictions(rf_fit_cv)
anxiety_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = gad2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.922
 2 Fold02 roc_auc binary         0.928
 3 Fold03 roc_auc binary         0.863
 4 Fold04 roc_auc binary         0.925
 5 Fold05 roc_auc binary         0.814
 6 Fold06 roc_auc binary         1    
 7 Fold07 roc_auc binary         0.95 
 8 Fold08 roc_auc binary         1    
 9 Fold09 roc_auc binary         0.810
10 Fold10 roc_auc binary         0.889
11 Fold11 roc_auc binary         0.906
12 Fold12 roc_auc binary         0.847
13 Fold13 roc_auc binary         0.75 
14 Fold14 roc_auc binary         0.821
15 Fold15 roc_auc binary         0.944
16 Fold16 roc_auc binary         0.941
17 Fold17 roc_auc binary         0.986
18 Fold18 roc_auc binary         0.947
19 Fold19 roc_auc binary         0.846
20 Fold20 roc_auc binary         0.897
# Plot ROC curve
anxiety_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = gad2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
anxiety_rf_fit <- rf_spec |>
  fit(gad2_cg_cat ~ ., data = anxiety_train)

anxiety_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 17.56%
Confusion matrix:
    0  1 class.error
0 293 33   0.1012270
1  46 78   0.3709677
#testing
anxiety_rf_pred_values <- bind_cols(
  truth = anxiety_test$gad2_cg_cat,  # Actual values of the outcome variable
  predict(anxiety_rf_fit, new_data = anxiety_test),  # Predicted class labels
  predict(anxiety_rf_fit, new_data = anxiety_test, type = "prob")  # Predicted probabilities
)
roc_auc(anxiety_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.913
#roc_auc = 0.8577
autoplot(roc_curve(anxiety_rf_pred_values, 
                   truth, 
                   .pred_0))

anxiety_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0            1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs    8.772185  3.891666965             9.689340         8.535514
cca11hwoftshp    8.914634 -3.249922284             6.638716         6.684683
cca11hwoftott   12.840321 -5.755665519             8.992654         3.141887
che11enrgylmt   19.363815  9.067818407            20.062786         7.277269
cac11diffphy    17.117287 12.720020027            20.166989        10.559962
cac11exhaustd   20.112879 29.441965139            28.631686        25.802666
cac11diffemlv   12.791522  8.008534010            14.644963         7.984403
che11health     21.355923 -0.002229208            20.054725        12.325969
che11sleepint   16.698535 18.041547065            21.092565        16.517465
ew11progneed1    5.153517 -0.963817163             2.986744         2.364413
race_recode      7.494383 15.558971182            12.274605        10.123643
gender_recode    6.233240 -1.694217211             5.037235         2.821945
edu_recode       9.884316 -6.726172854             6.133143         5.495283
chd11dage       17.100443  7.939682130            18.405530        18.968925
martstat_recode  7.922483  4.958208183             9.229450         3.600878
anxiety_rf_fit |>
  extract_fit_engine() |>
  vip()

#20-fold cross validation on rf
anxiety_folds<-vfold_cv(anxiety_train, v=20)

rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(gad2_cg_cat ~ .)

rf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(resamples = anxiety_folds)

collect_metrics(rf_fit_cv_anxiety)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.818    20 0.0204  Preprocessor1_Model1
2 brier_class binary     0.116    20 0.00814 Preprocessor1_Model1
3 roc_auc     binary     0.900    20 0.0185  Preprocessor1_Model1
#roc_auc = 0.917

rf_wf_fit_cv_anxiety <- rf_workflow |>
  fit_resamples(
    resamples = anxiety_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_anxiety) #roc_auc = 0.9154106  
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.831    20 0.0193  Preprocessor1_Model1
2 brier_class binary     0.115    20 0.00812 Preprocessor1_Model1
3 roc_auc     binary     0.901    20 0.0182  Preprocessor1_Model1
rf_wf_fit_cv_anxiety |>
  collect_predictions() |>
  roc_curve(gad2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results
rf_wf_fit_cv_anxiety_metrics <- collect_metrics(rf_wf_fit_cv_anxiety)
#roc_auc = 0.9213547    

# If you need predictions for further analysis
rf_wf_fit_cv_anxiety_preds <- collect_predictions(rf_wf_fit_cv_anxiety)

Depression

rf_spec<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

depression_rf_fit<-rf_spec|>
  fit(phq2_cg_cat ~ ., data=depression_train)

## top variables
depression_rf_fit|>
  extract_fit_engine()|>
  vip()

depression_rf_fit|>
  extract_fit_engine()|>
  importance()|>
  as.data.frame()|>
  arrange(desc(MeanDecreaseGini))
                        0           1 MeanDecreaseAccuracy MeanDecreaseGini
cac11exhaustd   10.941282 21.87749057           19.4956963        18.856521
chd11dage       13.285941 11.53229808           16.6250223        18.723552
race_recode      9.625158 20.82411940           16.0951259        13.651313
che11health     22.006975  5.86114401           21.9884565        13.309505
che11sleepint   11.304676  7.08106762           13.5276649        11.574621
che11enrgylmt   18.006647 16.14924317           21.7869158         9.246681
cac11diffemlv   15.117145  6.54537698           16.1678717         7.863624
cca11hwoftchs   10.572198 -0.65018253            9.5838436         7.742361
cca11hwoftshp   12.327146  0.26854045           11.5869052         7.370325
cac11diffphy    13.625711  3.43270118           13.9546213         7.292088
edu_recode       9.527513  0.86260389            9.3295516         6.003018
cca11hwoftott   13.232292  2.16717073           12.2727057         3.884083
martstat_recode  8.661539  0.03285823            8.2098021         3.408549
gender_recode   10.181899 -1.70204484            8.8680854         3.075003
ew11progneed1    2.552437 -1.80685567            0.8479132         1.885141
#20fold CV- rf
depression_folds<-vfold_cv(depression_train, v=20)
depression_rf_workflow <- workflow() |>
  add_model(rf_spec) |>
  add_formula(phq2_cg_cat ~ .)

depression_rf_fit_cv <- depression_rf_workflow |>
  fit_resamples(depression_folds, control=control_resamples(save_pred=TRUE))
collect_metrics(depression_rf_fit_cv) 
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.811    20 0.0148  Preprocessor1_Model1
2 brier_class binary     0.122    20 0.00706 Preprocessor1_Model1
3 roc_auc     binary     0.875    20 0.0126  Preprocessor1_Model1
#roc_auc = 0.8715
depression_rf_cv_preds<-collect_predictions(depression_rf_fit_cv)

depression_rf_cv_preds|>
  group_by(id)|>
  roc_auc(truth = phq2_cg_cat, .pred_0)
# A tibble: 20 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.856
 2 Fold02 roc_auc binary         0.810
 3 Fold03 roc_auc binary         0.856
 4 Fold04 roc_auc binary         0.844
 5 Fold05 roc_auc binary         0.882
 6 Fold06 roc_auc binary         0.929
 7 Fold07 roc_auc binary         0.882
 8 Fold08 roc_auc binary         0.889
 9 Fold09 roc_auc binary         0.967
10 Fold10 roc_auc binary         0.762
11 Fold11 roc_auc binary         0.894
12 Fold12 roc_auc binary         0.901
13 Fold13 roc_auc binary         0.941
14 Fold14 roc_auc binary         0.786
15 Fold15 roc_auc binary         0.918
16 Fold16 roc_auc binary         0.946
17 Fold17 roc_auc binary         0.877
18 Fold18 roc_auc binary         0.897
19 Fold19 roc_auc binary         0.778
20 Fold20 roc_auc binary         0.889
# Plot ROC curve
depression_rf_cv_preds |>
  group_by(id)|>
  roc_curve(truth = phq2_cg_cat, .pred_0) |>
  autoplot()

# Fit the random forest model on the full training data
depression_rf_fit <- rf_spec |>
  fit(phq2_cg_cat ~ ., data = depression_train)

depression_rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 18.22%
Confusion matrix:
    0  1 class.error
0 303 30  0.09009009
1  52 65  0.44444444
#testing
depression_rf_pred_values <- bind_cols(
  truth = depression_test$phq2_cg_cat,  # Actual values of the outcome variable
  predict(depression_rf_fit, new_data = depression_test),  # Predicted class labels
  predict(depression_rf_fit, new_data = depression_test, type = "prob")  # Predicted probabilities
)
roc_auc(depression_rf_pred_values,
        truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.881
#roc_auc = 0.9126672

autoplot(roc_curve(depression_rf_pred_values, 
                   truth, 
                   .pred_0))

depression_rf_fit |>
  extract_fit_engine() |>
  importance()
                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
cca11hwoftchs    9.564661 -1.8043550          8.360324618         8.522971
cca11hwoftshp   11.156500 -1.6388262         10.487602039         7.259457
cca11hwoftott   13.401380  4.0767108         13.470714573         3.970576
che11enrgylmt   17.709253 14.7470929         21.190165886         9.277444
cac11diffphy    14.187913  3.7267181         14.385886477         7.175292
cac11exhaustd   11.259845 23.2929066         20.155748705        19.442645
cac11diffemlv   13.651184  6.2803699         14.657593829         8.109535
che11health     21.240807  9.4366175         22.756711524        13.224305
che11sleepint   10.480485  7.7604054         12.371038703        10.567631
ew11progneed1    1.689934 -1.7318822         -0.008375826         1.777307
race_recode      8.375363 22.1314620         15.322479947        13.500849
gender_recode    9.658573 -0.8184676          8.666887932         3.139169
edu_recode       9.223059 -0.2208215          8.648029917         5.787021
chd11dage       12.130080 12.6158467         15.550997941        19.622845
martstat_recode  8.449111 -0.3588132          7.915559517         3.289654
depression_rf_fit |>
  extract_fit_engine() |>
  vip()

rf_spec_depression<-rand_forest(trees=1000, min_n=5)|>
  set_engine("randomForest", importance=TRUE)|>
  set_mode("classification")

rf_fit_depression<-rf_spec_depression|>
  fit(phq2_cg_cat ~ ., data=depression_train)


#20-fold cross validation on rf
depression_folds<-vfold_cv(depression_train, v=20)

rf_workflow_depression <- workflow() |>
  add_model(rf_spec_depression) |>
  add_formula(phq2_cg_cat ~ .)

rf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(resamples = depression_folds)

collect_metrics(rf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.816    20 0.0224  Preprocessor1_Model1
2 brier_class binary     0.122    20 0.00944 Preprocessor1_Model1
3 roc_auc     binary     0.872    20 0.0200  Preprocessor1_Model1
#roc_auc = 0.8848794        

rf_wf_fit_cv_depression <- rf_workflow_depression |>
  fit_resamples(
    resamples = depression_folds,
    control = control_resamples(save_pred = TRUE)
  )

collect_metrics(rf_wf_fit_cv_depression) #roc_auc = 0.8825754   
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.816    20 0.0209  Preprocessor1_Model1
2 brier_class binary     0.120    20 0.00908 Preprocessor1_Model1
3 roc_auc     binary     0.875    20 0.0195  Preprocessor1_Model1
rf_wf_fit_cv_depression |>
  collect_predictions() |>
  roc_curve(phq2_cg_cat, .pred_0) |>
  autoplot()

# Collect metrics from the resampling results

collect_metrics(rf_wf_fit_cv_depression)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.816    20 0.0209  Preprocessor1_Model1
2 brier_class binary     0.120    20 0.00908 Preprocessor1_Model1
3 roc_auc     binary     0.875    20 0.0195  Preprocessor1_Model1
#roc_auc = 0.8825754        

# If you need predictions for further analysis
rf_wf_fit_cv_depression_preds <- collect_predictions(rf_wf_fit_cv_depression)

1.2.2.2 Graph

Anxiety

# ROC curve for GLM training set
anxiety_roc_glm_training <- anxiety_glm_pred_values |> roc_curve(truth, .pred_0)
# ROC curve for GLM cross-validation set
anxiety_roc_glm_cv <- anxiety.lr.pred.values.test |> roc_curve(truth, .pred_0)
# ROC curve for RF training set
anxiety_roc_rf_training <- anxiety_rf_pred_values|> roc_curve(truth, .pred_0)
# ROC curve for RF cross-validation set
anxiety_roc_rf_cv <- rf_wf_fit_cv_anxiety_preds |>
  roc_curve(truth = gad2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Anxiety ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.8, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.8, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.8, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.8, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in geom_path(data = anxiety_roc_glm_training, aes(x = 1 - specificity,
: Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_training, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = anxiety_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.8, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.8, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

1.2.3 Depression

library(yardstick)

# ROC curve for GLM training set
depression_roc_glm_training <- depression_glm_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for GLM cross-validation set
depression_roc_glm_cv <- depression.lr.pred.values.test |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF training set
depression_roc_rf_training <- depression_rf_pred_values |>
  roc_curve(truth = truth, .pred_0)

# ROC curve for RF cross-validation set
depression_roc_rf_cv <- depression_rf_cv_preds |>
  roc_curve(truth = phq2_cg_cat, .pred_0)


ggplot() +
  geom_path(data = depression_roc_glm_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "blue", linetype = "solid", size = 1, label = "Logistic Regression (Training)") +
  geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "green", linetype = "dashed", size = 1, label = "Logistic Regression (10-fold CV)") +
  geom_path(data = depression_roc_rf_training, aes(x = 1 - specificity, y = sensitivity), 
            color = "red", linetype = "solid", size = 1, label = "Random Forest (Training)") +
  geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, y = sensitivity), 
            color = "purple", linetype = "dashed", size = 1, label = "Random Forest (10-fold CV)") +
coord_equal() +
  theme_bw() +
  labs(title = "Depression ROC Curves for Logistic Regression and Random Forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       color = "Model", 
       linetype = "Type") +
  theme(legend.position = "right") + 
   geom_text(aes(x = 0.8, y = 0.2, label = "Logistic Regression (Training)"), 
            color = "blue", size = 2) +
  geom_text(aes(x = 0.8, y = 0.15, label = "Logistic Regression (10-fold CV)"), 
            color = "green", size = 2, linetype = "dashed") +
  geom_text(aes(x = 0.8, y = 0.1, label = "Random Forest (Training)"), 
            color = "red", size = 2) +
  geom_text(aes(x = 0.8, y = 0.05, label = "Random Forest (10-fold CV)"), 
            color = "purple", size = 2, linetype = "dashed")
Warning in geom_path(data = depression_roc_glm_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_glm_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_training, aes(x = 1 -
specificity, : Ignoring unknown parameters: `label`
Warning in geom_path(data = depression_roc_rf_cv, aes(x = 1 - specificity, :
Ignoring unknown parameters: `label`
Warning in geom_text(aes(x = 0.8, y = 0.15, label = "Logistic Regression
(10-fold CV)"), : Ignoring unknown parameters: `linetype`
Warning in geom_text(aes(x = 0.8, y = 0.05, label = "Random Forest (10-fold
CV)"), : Ignoring unknown parameters: `linetype`

1.3 Conclusion

This the conclusion. The Section 1.1 can be invoked here.